Skip to content

feat(cloud-agent): classify retryable runtime failures#3825

Merged
eshurakov merged 1 commit into
mainfrom
boiled-timbale
Jun 12, 2026
Merged

feat(cloud-agent): classify retryable runtime failures#3825
eshurakov merged 1 commit into
mainfrom
boiled-timbale

Conversation

@eshurakov

@eshurakov eshurakov commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add a shared { code, message, retryable } client-error contract across Cloud Agent tRPC responses, terminal callbacks, and message-result polling.
  • Treat agent, provider, wrapper, network, and timeout failures as retryable regardless of lifecycle stage, while keeping deterministic model, payment, input, metadata, and user-interruption failures non-retryable.
  • Preserve typed payment_required and model_missing failures through wrapper ingest and expose optional failureStage so callback clients can choose their recovery strategy.
  • Keep Code Review, Security Auto Analysis, Auto Fix, bot, and webhook consumer policies unchanged; the new callback fields remain additive and can be adopted separately.

Verification

  • Manual verification not performed; this is a backend-only contract and classification change.

Visual Changes

N/A

Reviewer Notes

  • retryable: true permits a new logical attempt with a new messageId; it does not automatically replay work or make a terminal message ID reusable.
  • Failures after agent_activity may have partial side effects. The callback's failureStage lets clients choose between replaying, continuing, inspecting state, or requesting user input.
  • Database telemetry keeps its existing stage/code vocabulary: typed payment/model failures are normalized only for persistence reports, while callbacks and message results retain the exact classification.

@kilo-code-bot

kilo-code-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

Full review of the squashed single-commit PR (02703cfbb). The PR correctly classifies payment_required and model_missing as structured failure codes throughout the stack — from wrapper event detection through DO state storage, callback payload, queue reports, and message-result polling — with no logic errors, security concerns, or runtime hazards found.

Key Design Points (verified correct)
  • getTerminalFailure() in connection.ts cleanly replaces isTerminalError(), returning a typed WrapperTerminalFailure with optional code. usage_limit_exceeded fires without a code (intentional — no structured classification), while payment_required/insufficient_funds emit code: 'payment_required'.
  • session.error guards are correct and consistent with existing tests: both guards require sessionID matching for session.error events.
  • persistedFailureCode() in queue-reports.ts maps payment_required/model_missing to DB-safe codes (assistant_error, wrapper_error_before_activity) for telemetry persistence. The fallback return state.failureCode is unreachable in practice since these codes only arrive via wrapper terminal events at agent_activity or post_dispatch_no_activity stages.
  • FAILED_RUN_DIAGNOSTIC_MESSAGES does not include payment_required, so the remapping to assistant_error ensures the diagnostic IS emitted (rather than silently skipped). isKnownInsufficientCreditFailure then overrides to the credit-specific message when the error text matches.
  • PublicErrorCodeSchema regex (/^[A-Z][A-Z0-9_]*$/) is satisfied by all failureCode.toUpperCase() outputs (e.g. PAYMENT_REQUIRED, MODEL_MISSING, EXECUTION_FAILED, EXECUTION_INTERRUPTED).
  • isAdmissionFailureRetryable() correctly marks PENDING_QUEUE_FULL and INTERNAL as retryable (not in RETRYABLE_CODES for HTTP mapping purposes but explicitly added as retryable for cause semantics).
  • createClientError() in trpc-error.ts enforces NON_RETRYABLE_CODES unconditionally, so an inconsistent retryable: true cause on a BAD_REQUEST is safely overridden.
  • sanitizePublicEventData() strips failureCode from the broadcast event — correct, as it is internal state not meant for public exposure.
  • Consumer schemas (webhook-agent-ingest, security-auto-analysis, code-review-status route) were previously updated to remove clientError/failureStage parsing; this PR adds them back as additive optional fields on the sender side without breaking those consumers.
  • All previously flagged issues from earlier review rounds are resolved in this squashed commit.
Files Reviewed (35 files)
  • packages/db/src/schema.ts
  • packages/worker-utils/package.json
  • packages/worker-utils/src/client-error.ts
  • packages/worker-utils/src/client-error.test.ts
  • packages/worker-utils/src/cloud-agent-failure.ts
  • services/cloud-agent-next/src/callbacks/types.ts
  • services/cloud-agent-next/src/middleware/auth.test.ts
  • services/cloud-agent-next/src/middleware/balance.test.ts
  • services/cloud-agent-next/src/persistence/CloudAgentSession.ts
  • services/cloud-agent-next/src/router/auth.test.ts
  • services/cloud-agent-next/src/router/auth.ts
  • services/cloud-agent-next/src/router/schemas.test.ts
  • services/cloud-agent-next/src/router/schemas.ts
  • services/cloud-agent-next/src/session/message-result.test.ts
  • services/cloud-agent-next/src/session/message-result.ts
  • services/cloud-agent-next/src/session/message-settlement-outbox.test.ts
  • services/cloud-agent-next/src/session/message-settlement-outbox.ts
  • services/cloud-agent-next/src/session/queue-message.test.ts
  • services/cloud-agent-next/src/session/queue-message.ts
  • services/cloud-agent-next/src/session/safe-failure-projection.ts
  • services/cloud-agent-next/src/session/terminal-error-projector.test.ts
  • services/cloud-agent-next/src/session/terminal-error-projector.ts
  • services/cloud-agent-next/src/session/wrapper-supervisor.test.ts
  • services/cloud-agent-next/src/session/wrapper-supervisor.ts
  • services/cloud-agent-next/src/shared/protocol.ts
  • services/cloud-agent-next/src/telemetry/queue-reports.test.ts
  • services/cloud-agent-next/src/telemetry/queue-reports.ts
  • services/cloud-agent-next/src/trpc-error.test.ts
  • services/cloud-agent-next/src/trpc-error.ts
  • services/cloud-agent-next/src/websocket/ingest.test.ts
  • services/cloud-agent-next/src/websocket/ingest.ts
  • services/cloud-agent-next/test/integration/session/legacy-callback-enqueue.test.ts
  • services/cloud-agent-next/test/unit/wrapper/reconnection.test.ts
  • services/cloud-agent-next/wrapper/src/connection.ts
  • services/cloud-agent-next/wrapper/src/main.ts

Reviewed by claude-4.6-sonnet-20260217 · 4,117,451 tokens

Review guidance: REVIEW.md from base branch main

Comment thread services/cloud-agent-next/src/trpc-error.ts Outdated
@eshurakov eshurakov changed the title feat(cloud-agent): expose structured retryable errors feat(cloud-agent): classify retryable runtime failures Jun 9, 2026
Comment thread apps/web/src/app/api/internal/code-review-status/[reviewId]/route.ts Outdated
Comment thread services/security-auto-analysis/src/callbacks.ts Outdated
Comment thread services/cloud-agent-next/src/session/wrapper-supervisor.ts
@eshurakov eshurakov requested a review from alex-alecu June 11, 2026 14:45
Comment thread packages/db/src/schema.ts
| 'assistant_error'
| 'wrapper_error_after_activity'
| 'missing_assistant_reply'
| 'payment_required'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kilocode-bot Why was this removed? What replaced this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer: 'payment_required' was added, not removed. The + in the diff hunk means this is a new member being inserted into the union — nothing above or below it was removed.

Why it was added

The DB enum is bound to the runtime types via a compile-time contract in session-message-state.ts:

type CodeContractMatchesReport = AssertTrue<
  CloudAgentRunFailureCode extends CloudAgentFailureCode ? true : false
>;

This PR adds payment_required in lockstep to all layers — packages/db/src/schema.ts (here), packages/worker-utils/src/cloud-agent-failure.ts, and safe-failure-projection.ts. If it were added to CLOUD_AGENT_FAILURE_CODES but not to the DB type, the AssertTrue contract would fail to compile. The schema addition is required for the Durable Object to carry a typed failureCode: 'payment_required' at all.

What it does

Previously, insufficient-credits failures were flattened into assistant_error (indistinguishable from any generic error). This PR introduces a typed wrapper signal (WrapperTerminalFailureCodes = ['payment_required', 'model_missing']) so the wrapper can now communicate the precise failure, which terminal-error-projector.ts classifies as non-retryable — feeding the new clientError: { code, message, retryable } contract on callbacks and getMessageResult.

The subtlety

payment_required is intentionally not persisted to the DB column. In telemetry/queue-reports.ts, persistedFailureCode() normalizes it away before writing to failure_code:

function persistedFailureCode(state) {
  if (state.failureCode !== 'model_missing' && state.failureCode !== 'payment_required')
    return state.failureCode;
  if (state.failureStage === 'agent_activity') return 'assistant_error';
  if (state.failureStage === 'post_dispatch_no_activity') return 'wrapper_error_before_activity';
  return state.failureCode;
}

So if the concern is "this enum value will never actually be written to the DB column" — that's accurate and intentional by design. The value exists in CloudAgentSessionRunFailureCode solely to satisfy the shared contract the DO runtime depends on, not because the column is expected to store it.

@eshurakov eshurakov merged commit 168fc02 into main Jun 12, 2026
57 checks passed
@eshurakov eshurakov deleted the boiled-timbale branch June 12, 2026 10:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants